You are allowed and encouraged to work with one partner on this project. Include your names, perm numbers, and whether you are taking the class for 131 or 231 credit.
You are welcome and encouraged to write up your report as a research paper (e.g. abstract, introduction, methods, results, conclusion) as long as you address each of the questions below. Alternatively, you can format the assignment like a long homework by addressing each question in parts.
There should be no raw R output in the body of the paper! All of your results should be formatted in a professional and visually appealing manner. That means, eather as a polished visualization or, for tabular data, a nicely formatted table (see the documentation for kable and kableExtra packages). If you feel you must include extensive raw R output, this should be included in an appendix, not the main report.
All R code should be available from your Rmarkdown file, but does not need to be shown in the body of the report! Use the chunk option echo=FALSE to exclude code from appearing in your writeup. In addition to your Rmarkdown file, you are required to submit the writuep as either a pdf document or an html file (both are acceptable).
The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.
Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us. Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset.
Answer the following questions in one paragraph for each.
What makes voter behavior prediction (and thus election forecasting) a hard problem? Voter prediction behavior is difficult because there are many unknown sources of error embedded in the surveys that are released to the public in advance of the election. One of the most prominent issues is that of non-response bias which boils down to the individuals actually responding to the survey not providing a representative proportion of candidate preferences across the population. For example, if Hillary supporters are the primary indivdiauls responding to the survey, then it may seem like the survey results suggest a very strong chance of Hillary winning the election when in actuallity there are many Trump supporters who just don’t care to respond to online surveys. Efforts can be made to minimize this unpredictability through improved distribution of the survey across a variety of channels and accounting for non-response bias in forecasting formulas, but this remains difficult to quantify. Unpredictability in survey results can be further exacerbated if the individuals responding to surveys also are not actually making their way to the voting polls while non-respondents to the surveys may take action to vote on Election day. Furthermore, even if poll responses are fairly uniform, voters can also change their decision about who they are voting for in the swing of a moment for uncontrollable reasons which can make forecasts very susceptible to error.
What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions? Nate Silver uses a hierarchical modelling approach that generates a time series of state level voter preferences that accounts for key variables in state-level voting population (wealth, race, etc.). Incorporating temporally variable shifts that favor one candidate over another such as a drop in unemployment at the state level or an increase in national taxes can further improve the estimation of voter behavior predictions over time. Silver’s application of hierarchical modelling is particularly powerful for accounting for how state and national polls can influence each other, especially if some state polls occur before others. The hierarchical modelling framework allows for information to move throughout the model over time to inform new predictions. Silver’s approach is also particularly advanced because he calculates the full range of probabilities for a candidate winning at the state level on a given day rather than just the maximum probability. Combining this with the hierarchical model and simulating predictions forward with weighting for the probability that the starting point was correct allows for particularly high accuracy. At the end of the day, the wealth of polling data leading up to the 2012 election also was essential to the accuracy of Silver’s prediction strategy.
What went wrong in 2016? What do you think should be done to make future predictions better? A key issue in the 2016 election predictions seems to be the presences of systemic polling error. Errors in individual level polls accrued greater error when aggregated to state levels that came to misrepresent predicted outcomes to an even greater extent for national level predictions. Polling methods (live versus recorded voice surveys) showed differing degrees of support for candidates. In the case of the 2016 election, many Trump supporters may have been less willing to voice their preference in a poll, especially to a live voice. This variation in respondent preference could lead to misleading poll data, which caused the Nate Silver’s predictions to suffer. Furthermore, the lackluster turnouts for both Democratic poll respondents and voters on Election Day, particularly in the Midwest region, also led to skewed predictions. In order to improve future election predictions, forecasting methods will rely on more robust poll data similar to what was available leading up to the 2012 elections to provide an accurate basis for predictions. With sparse data that also bears an underlying bias, pollsters will struggle to make accurate predictions. It is unclear if respondents will take polls more or less seriously in the aftermath of the 2016 election, and pollsters will also need to attempt to account for an imbalance in the shift in actions of poll respondents with different preferences. Future polling results may also gain reliability by disclaiming greater uncertainty in their reports. While this is counterintuitive with polls trying to claim the highest accuracy possible, acknowledging the inherent uncertainty could lead polling results to have a less severe impact on voter behavior.
election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl
The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.
In our dataset, fips values denote the area (US, state, or county) that each row of data represent. For example, a fips value of 6037 denotes Los Angeles County.
| county | fips | candidate | state | votes |
|---|---|---|---|---|
| Los Angeles County | 6037 | Hillary Clinton | CA | 2464364 |
| Los Angeles County | 6037 | Donald Trump | CA | 769743 |
| Los Angeles County | 6037 | Gary Johnson | CA | 88968 |
| Los Angeles County | 6037 | Jill Stein | CA | 76465 |
| Los Angeles County | 6037 | Gloria La Riva | CA | 21993 |
Some rows in election.raw are summary rows and these rows have county value of NA. There are two kinds of summary rows:
fips value of US.fips value.election.raw after removing rows with fips=2000. Provide a reason for excluding them. Please make sure to use the same name election.raw before and after removing those observations.After removing rows with fips = 2000, the election.raw table has 18345 observations and 5 variables. Alaska has a fips value of 2000, so the rows where fips=2000 are indeed state-level summary of election results. However, the state-level summary rows of Alaska are already available when we read the data, so it makes no sense to have duplicate records.
Following is the first few rows of the census data:
| State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alabama | Autauga | 1948 | 940 | 1008 | 0.9 | 87.4 | 7.7 | 0.3 | 0.6 | 0.0 | 1503 | 61838 | 11900 | 25713 | 4548 | 8.1 | 8.4 | 34.7 | 17.0 | 21.3 | 11.9 | 15.2 | 90.2 | 4.8 | 0 | 0.5 | 2.3 | 2.1 | 25.0 | 943 | 77.1 | 18.3 | 4.6 | 0 | 5.4 |
| Alabama | Autauga | 2156 | 1059 | 1097 | 0.8 | 40.4 | 53.3 | 0.0 | 2.3 | 0.0 | 1662 | 32303 | 13538 | 18021 | 2474 | 25.5 | 40.3 | 22.3 | 24.7 | 21.5 | 9.4 | 22.0 | 86.3 | 13.1 | 0 | 0.0 | 0.7 | 0.0 | 23.4 | 753 | 77.0 | 16.9 | 6.1 | 0 | 13.3 |
| Alabama | Autauga | 2968 | 1364 | 1604 | 0.0 | 74.5 | 18.6 | 0.5 | 1.4 | 0.3 | 2335 | 44922 | 5629 | 20689 | 2817 | 12.7 | 19.7 | 31.4 | 24.9 | 22.1 | 9.2 | 12.4 | 94.8 | 2.8 | 0 | 0.0 | 0.0 | 2.5 | 19.6 | 1373 | 64.1 | 23.6 | 12.3 | 0 | 6.2 |
| Alabama | Autauga | 4423 | 2172 | 2251 | 10.5 | 82.8 | 3.7 | 1.6 | 0.0 | 0.0 | 3306 | 54329 | 7003 | 24125 | 2870 | 2.1 | 1.6 | 27.0 | 20.8 | 27.0 | 8.7 | 16.4 | 86.6 | 9.1 | 0 | 0.0 | 2.6 | 1.6 | 25.3 | 1782 | 75.7 | 21.2 | 3.1 | 0 | 10.8 |
| Alabama | Autauga | 10763 | 4922 | 5841 | 0.7 | 68.5 | 24.8 | 0.0 | 3.8 | 0.0 | 7666 | 51965 | 6935 | 27526 | 2813 | 11.4 | 17.5 | 49.6 | 14.2 | 18.2 | 2.1 | 15.8 | 88.0 | 10.5 | 0 | 0.0 | 0.6 | 0.9 | 24.8 | 5037 | 67.1 | 27.6 | 5.3 | 0 | 4.2 |
| Alabama | Autauga | 3851 | 1787 | 2064 | 13.1 | 72.9 | 11.9 | 0.0 | 0.0 | 0.0 | 2642 | 63092 | 9585 | 30480 | 7550 | 14.4 | 21.9 | 24.2 | 17.5 | 35.4 | 7.9 | 14.9 | 82.7 | 6.9 | 0 | 0.0 | 6.0 | 4.5 | 19.8 | 1560 | 79.4 | 14.7 | 5.8 | 0 | 10.9 |
Column information is given in the metadata file.
Move summary rows from election.raw data into federal or state level summary files: i.e.,
Federal-level summary into a election_federal.
State-level summary into a election_state.
Only county-level data is to remain in election.
How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate. You can split this into multiple plots or may prefer to plot the results on the log scale. Either way, the results should be clear and legible!
The 2016 election had a total of 31 candidates with a 32nd category for additional candidates that collected a total of 28,863 votes across the US.
county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).## Warning: Column `fips` joining character vector and factor, coercing into
## character vector
## Warning: Column `fips` joining factors with different levels, coercing to
## character vector
census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.The census data contains high resolution information (more fine-grained than county-level).
In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:
Clean census data census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove these variables after creating Minority, remove {Walk, PublicWork, Construction}.
Many columns seem to be related, and, if a set that adds up to 100%, one column will be deleted. E.g., Men and Women comprise 100% of the TotalPop, so we only two of the counts to know the third, and would choose one to delete.
Sub-county census data, census.subct: start with census.del from above, group_by() two attributes {State, County}, use add_tally() to compute CountyTotal. Also, compute the weight by TotalPop/CountyTotal.
County census data, census.ct: start with census.subct, use summarize_at() to compute the weighted sum.
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
* _Print the first few rows of `census.ct`_:
| State | County | CountyPop | Men | White | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Production | Drive | Carpool | Transit | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | SelfEmployed | FamilyWork | Unemployment | Minority | weight |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alabama | Autauga | 6486.404 | 0.4843266 | 75.78823 | 0.7374912 | 51696.29 | 7771.009 | 24974.50 | 3433.674 | 12.91231 | 18.70758 | 32.79097 | 17.17044 | 24.28243 | 17.15713 | 87.50624 | 8.781235 | 0.0952590 | 1.3059687 | 1.8356531 | 26.50016 | 0.4343637 | 73.73649 | 5.433254 | 0.0000000 | 7.733726 | 22.53687 | 1 |
| Alabama | Baldwin | 7698.957 | 0.4884866 | 83.10262 | 0.7569406 | 51074.36 | 8745.050 | 27316.84 | 3803.718 | 13.42423 | 19.48431 | 32.72994 | 17.95092 | 27.10439 | 11.32186 | 84.59861 | 8.959078 | 0.1266209 | 1.4438000 | 3.8504774 | 26.32218 | 0.4405113 | 81.28266 | 5.909353 | 0.3633269 | 7.589820 | 15.21426 | 1 |
| Alabama | Barbour | 3325.195 | 0.5382816 | 46.23159 | 0.7691222 | 32959.30 | 6031.065 | 16824.22 | 2430.189 | 26.50563 | 43.55962 | 26.12404 | 16.46343 | 23.27878 | 23.31741 | 83.33021 | 11.056609 | 0.4954032 | 1.6217251 | 1.5019456 | 24.51828 | 0.3192113 | 71.59426 | 7.149837 | 0.0897742 | 17.525557 | 51.94382 | 1 |
| Alabama | Bibb | 6380.718 | 0.5341090 | 74.49989 | 0.7739781 | 38886.63 | 5662.358 | 18430.99 | 3073.599 | 16.60375 | 27.19708 | 21.59010 | 17.95545 | 17.46731 | 23.74415 | 83.43488 | 13.153641 | 0.5031366 | 1.5620952 | 0.7314679 | 28.71439 | 0.3669262 | 76.74385 | 6.637936 | 0.3941515 | 8.163104 | 24.16597 | 1 |
| Alabama | Blount | 7018.573 | 0.4940565 | 87.85385 | 0.7337550 | 46237.97 | 8695.786 | 20532.27 | 2052.055 | 16.72152 | 26.85738 | 28.52930 | 13.94252 | 23.83692 | 20.10413 | 84.85031 | 11.279222 | 0.3626321 | 0.4199411 | 2.2654133 | 34.84489 | 0.3844914 | 81.82671 | 4.228716 | 0.3564928 | 7.699640 | 10.59474 | 1 |
| Alabama | Bullock | 4263.211 | 0.5300618 | 22.19918 | 0.7545420 | 33292.69 | 9000.345 | 17579.57 | 3110.645 | 24.50260 | 37.29116 | 19.55253 | 14.92420 | 20.17051 | 25.73547 | 74.77277 | 14.839127 | 0.7732160 | 1.8238247 | 3.0998783 | 28.63106 | 0.3619592 | 79.09065 | 5.273684 | 0.0000000 | 17.890026 | 76.53587 | 1 |
ct.pc and subct.pc, for county and sub-county respectively. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correaltion between these features?For this PCA at both the subcounty and county levels, we opted to center and scale the features before running PCA because the magnitude of the different covariates used to generate the principal components can vary considerably. For example, the magnitude of the TotalPop value will inherently be on a different order scale from the documented percentage values. Without this scaling, most of the principal components would be driven by the TotalPop variable that has the highest mean and variance of all the variables. We wanted to center the data such that the effects of different principal components could be effectively compared from a common mean of zero.
kable(subct.pc[order(abs(subct.pr$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "Subcounty PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 Ordered | PC2 |
|---|---|
| 1.8799435 | -1.7903730 |
| -0.2223740 | -1.2470990 |
| -0.2051629 | -2.0079701 |
| -0.7511110 | -1.5172255 |
| -0.3704778 | -1.8203649 |
| 1.2713954 | -0.7739458 |
| 0.2448866 | -1.5654512 |
| -0.3158999 | -1.3582036 |
| -0.2987879 | -2.0226222 |
| -1.6035325 | -0.7210582 |
| -1.3703754 | -1.0523781 |
| 1.6308928 | -1.9615056 |
| -0.1688217 | -0.6240623 |
| -2.7141429 | -1.3163688 |
| -0.6111902 | -0.9574739 |
| -0.1616315 | -1.7190563 |
| -0.9059190 | -1.4085426 |
| -0.2913751 | -0.6116355 |
| -0.0520694 | -1.9993488 |
| 2.7212503 | -0.9658149 |
| -1.1821343 | -0.9858235 |
| -1.3663013 | -1.3485339 |
| 0.2550698 | -1.4874826 |
| 2.5407220 | -1.0203307 |
| 0.1397307 | -1.6862941 |
| 0.0618780 | -1.6162404 |
| 1.3548816 | -1.8814244 |
| -2.6777629 | -0.7000684 |
kable(subct.pc[order(abs(subct.pr$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "Subcounty PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 | PC2 Ordered |
|---|---|
| -0.0520694 | -1.9993488 |
| -0.9059190 | -1.4085426 |
| -0.2987879 | -2.0226222 |
| -0.3158999 | -1.3582036 |
| 0.2550698 | -1.4874826 |
| -2.6777629 | -0.7000684 |
| -0.1616315 | -1.7190563 |
| 1.3548816 | -1.8814244 |
| -0.1688217 | -0.6240623 |
| -1.3703754 | -1.0523781 |
| 1.6308928 | -1.9615056 |
| 1.8799435 | -1.7903730 |
| -0.7511110 | -1.5172255 |
| -0.2223740 | -1.2470990 |
| 2.7212503 | -0.9658149 |
| -2.7141429 | -1.3163688 |
| 0.2448866 | -1.5654512 |
| -1.6035325 | -0.7210582 |
| 1.2713954 | -0.7739458 |
| -1.1821343 | -0.9858235 |
| -0.2051629 | -2.0079701 |
| 2.5407220 | -1.0203307 |
| 0.1397307 | -1.6862941 |
| 0.0618780 | -1.6162404 |
| -0.6111902 | -0.9574739 |
| -0.3704778 | -1.8203649 |
| -1.3663013 | -1.3485339 |
| -0.2913751 | -0.6116355 |
kable(ct.pc[order(abs(ct.pr$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "County PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 Ordered | PC2 |
|---|---|
| 2.9949113 | -0.3651833 |
| 0.8509903 | 0.6892077 |
| 2.1312585 | -2.0692849 |
| 0.7364145 | 0.1924133 |
| 0.7017505 | -2.1698401 |
| 1.3278622 | -0.6520276 |
| 0.9449785 | -1.6447003 |
| -0.2389475 | -1.9785152 |
| 4.1988349 | 0.0683149 |
| 1.7259839 | -1.0476904 |
| 3.0289534 | -0.7681529 |
| 3.4814146 | -0.4731179 |
| 4.0378067 | -0.9768704 |
| 1.5984119 | -0.2959302 |
| 1.0513009 | -0.6938956 |
| 0.6940356 | -0.0832079 |
| 0.2403462 | -0.4233737 |
| 0.8703187 | -1.5079796 |
| 1.5610517 | 0.0135212 |
| 0.5866700 | -0.6149344 |
| 4.9462389 | -0.5573296 |
| -0.3501971 | -2.0554138 |
| 3.3324835 | -1.9561621 |
| 5.3731041 | -0.2952842 |
| -1.0421788 | -2.3453701 |
| 1.9171297 | -0.1546236 |
| 3.4668768 | -0.4790275 |
kable(ct.pc[order(abs(ct.pr$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "County PCA Loadings") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
| PC1 | PC2 Ordered |
|---|---|
| 1.0513009 | -0.6938956 |
| 0.5866700 | -0.6149344 |
| -0.3501971 | -2.0554138 |
| 3.4814146 | -0.4731179 |
| 3.3324835 | -1.9561621 |
| 4.9462389 | -0.5573296 |
| 4.0378067 | -0.9768704 |
| 1.5610517 | 0.0135212 |
| 0.6940356 | -0.0832079 |
| 0.7017505 | -2.1698401 |
| 1.9171297 | -0.1546236 |
| -1.0421788 | -2.3453701 |
| 0.8703187 | -1.5079796 |
| 1.7259839 | -1.0476904 |
| 4.1988349 | 0.0683149 |
| -0.2389475 | -1.9785152 |
| 2.9949113 | -0.3651833 |
| 1.3278622 | -0.6520276 |
| 0.2403462 | -0.4233737 |
| 5.3731041 | -0.2952842 |
| 0.7364145 | 0.1924133 |
| 2.1312585 | -2.0692849 |
| 3.0289534 | -0.7681529 |
| 0.8509903 | 0.6892077 |
| 0.9449785 | -1.6447003 |
| 3.4668768 | -0.4790275 |
| 1.5984119 | -0.2959302 |
After running the PCA, the three features that have the largest magnitude for PC1 at the subcounty level are IncomePerCap, Professional, and Poverty covariates. The top three covariates with the largest magnitude of PC1 at the county level are IncomePerCap, ChildPoverty, and Poverty.
| PC1 | Feature | Administrative Level |
|---|---|---|
| 0.3181199 | IncomePerCap | Subcounty |
| 0.3064366 | Professional | Subcounty |
| 0.3046886 | Poverty | Subcounty |
| 0.3506105 | IncomePerCap | County |
| 0.3432571 | ChildPoverty | County |
| 0.3420707 | Poverty | County |
| PC1 | PC2 | Feature |
|---|---|---|
| -0.0324 | -0.0324 | SubcountyPop |
| -0.0173 | -0.0173 | Men |
| -0.2404 | -0.2404 | White |
| -0.1608 | -0.1608 | Citizen |
| -0.3025 | -0.3025 | Income |
| -0.1989 | -0.1989 | IncomeErr |
| -0.3181 | -0.3181 | IncomePerCap |
| -0.2123 | -0.2123 | IncomePerCapErr |
| 0.3047 | 0.3047 | Poverty |
| 0.2979 | 0.2979 | ChildPoverty |
| -0.3064 | -0.3064 | Professional |
| 0.2688 | 0.2688 | Service |
| 0.0138 | 0.0138 | Office |
| 0.2068 | 0.2068 | Production |
| -0.0789 | -0.0789 | Drive |
| 0.1626 | 0.1626 | Carpool |
| 0.0573 | 0.0573 | Transit |
| 0.0451 | 0.0451 | OtherTransp |
| -0.1730 | -0.1730 | WorkAtHome |
| -0.0100 | -0.0100 | MeanCommute |
| -0.2212 | -0.2212 | Employed |
| 0.0420 | 0.0420 | PrivateWork |
| -0.0697 | -0.0697 | SelfEmployed |
| -0.0152 | -0.0152 | FamilyWork |
| 0.2528 | 0.2528 | Unemployment |
| 0.2420 | 0.2420 | Minority |
| 0.0215 | 0.0215 | CountyTotal |
| 0.0119 | 0.0119 | weight |
| PC1 | PC2 | Feature |
|---|---|---|
| 0.0243 | -0.3428 | CountyPop |
| -0.0076 | 0.1506 | Men |
| -0.2242 | 0.1262 | White |
| -0.0064 | 0.1631 | Citizen |
| -0.3181 | -0.2069 | Income |
| -0.1679 | -0.2558 | IncomeErr |
| -0.3506 | -0.1211 | IncomePerCap |
| -0.1937 | -0.1307 | IncomePerCapErr |
| 0.3421 | 0.0267 | Poverty |
| 0.3433 | 0.0238 | ChildPoverty |
| -0.2497 | -0.0126 | Professional |
| 0.1817 | 0.0259 | Service |
| 0.0174 | -0.2838 | Office |
| 0.1184 | -0.0090 | Production |
| 0.0956 | -0.2167 | Drive |
| 0.0766 | 0.0488 | Carpool |
| -0.0698 | -0.1418 | Transit |
| 0.0097 | 0.0425 | OtherTransp |
| -0.1770 | 0.3169 | WorkAtHome |
| 0.0606 | -0.2349 | MeanCommute |
| -0.3277 | -0.0314 | Employed |
| -0.0540 | -0.3435 | PrivateWork |
| -0.1008 | 0.3928 | SelfEmployed |
| -0.0505 | 0.2643 | FamilyWork |
| 0.2914 | -0.0938 | Unemployment |
| 0.2276 | -0.1251 | Minority |
| 0.0054 | 0.0106 | weight |
At the county level, at least 14 principal components are required to capture 90% of the variance. At the subcounty level, 17 principal components are needed to represent 90% of the variance in the data.
census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the original features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate cluster? Comment on what you observe and discuss possible explanations for these observations.San Mateo county is found in clusters 1 and 1 for the hierarchical clustering using the census data and first five PC, respectively. Based on the two visualizations of each clustering approach, one can start to evealuate which clustering approach is qualitatively superior. The San Mateo cluster generated using the census data (cluster 2, orange triange) exhibits considerable overlap with other clusters which is indicative of poor clustering. This approach also included 379 counties in the San Mateo cluster. By comparison, the clustering based on principal components data seems to do a better job of isolating the San Mateo county cluster (cluster 5, green diamond), and this approach only included 112 counties in the cluster with San Mateo. This decrease in cluster size likely assisted in reducing variation.
This qualitative analysis is further supported by a quantitative measure of within cluster variation that is smaller for the PC-based clustering as opposed to that for the raw-census data clustering (3.888027610^{8} versus 2.834575810^{8}).
Based on these results, the hierarchical clustering approach that utilized the first five PC generated from PCA performed on the census data seemed to produce a more appropriate cluster for San Mateo County as opposed to hierarchical clustering based on the raw census data.
In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. The following code makes the necessary changes to merge them into election.cl for classification.
Using the following code, partition data into 80% training and 20% testing:
Using the following code, define 10 cross-validation folds:
Using the following error rate function:
cv.tree(). Prune the resulting tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to a records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)The first tree below shows the tree that is generated from using all of the regressors in the training data from the election data set is used to predict candidate outcome.
This second tree below shows the pruned version of the original tree that is produced after usign a 10 fold cross-valication approach to determine the best size of tree to minimize classification error. This tree has been pruned to the size determined by that tuning parameter.
records variable. What are the significant variables? Are these consistent with what you observed in the decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables. Did your particular county (from question 13) results match the predicted results?## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
The most significant variables in predicting the winninng candidate with a logistic function are Citizen, Professional, Service, Production, Drive, Employed, Private Work and Unemploymet each with a p ~ 0. Other variables that were quite signficant include White, IncomePerCap, Carpool, and FamilyWork with a p <0.001 as well as Office with a p < 0.01 and Income with a p < 0.05. It seems that many of the variables are significant, which makes sense as we have a perfect separation in our result. In the decision tree we see that the main splitting variable was Transit where the majority of nodes predicting Trump fell to the left side and the majority of nodes predicting Clinton fell to the right side. Transit in the logistic model, on the other hand, was not a signficiant predictor. White was used in both models in the second tier of the decision tree and a relatively strong predictor in the logsitic model. CountyPop was used to make the second split in the right side of the tree but was not a significant predictor in the logistic model. This may be because as we can see in the tree only very large counties would predict Hillary while there was more variation in the medium to small counties. Professional and Production were two variables that were consistent between the decision tree and logsitic regression analysis. The production variable, however, was an end node in the decision tree that resulted in two outputs of Trump and therefore was not very helpful at distinguishing between the candidates.
**talk about home county…
glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression under the LASSO penalty. Reminder: set alpha=1 to run LASSO. What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\)? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.| Coefficient Estimate | |
|---|---|
| (Intercept) | -2.717 |
| CountyPop | 0.000 |
| Men | 0.000 |
| White | -0.036 |
| Citizen | 3.738 |
| Income | 0.000 |
| IncomeErr | 0.000 |
| IncomePerCap | 0.000 |
| IncomePerCapErr | 0.000 |
| Poverty | 0.070 |
| ChildPoverty | -0.012 |
| Professional | 0.108 |
| Service | 0.117 |
| Office | 0.029 |
| Production | 0.046 |
| Drive | -0.056 |
| Carpool | -0.046 |
| Transit | 0.000 |
| OtherTransp | 0.019 |
| WorkAtHome | -0.027 |
The non-zero coefficients for the LASSO model are White, Citizen, Poverty, ChildPoverty, Professional, Service, Office, Production, Drive, Carpool, OtherTransp, and WorkAtHome.
| Training Error | Test Error | |
|---|---|---|
| tree | 0.0883550 | 0.0975610 |
| logistic | 0.0663681 | 0.0780488 |
| lasso | 0.0728827 | 0.0943089 |
Some possibilities for further exploration are:
Data preprocessing: we aggregated sub-county level data before performing classification. Would classification at the sub-county level before determining the winner perform better? What implicit assumptions are we making?
Exploring additional classification methods: KNN, LDA, QDA, SVM, random forest, boosting etc. (You may research and use methods beyond those covered in this course). How do these compare to logistic regression and the tree method?
trn.cl.new <- droplevels(trn.cl)
tst.cl.new <- droplevels(tst.cl)
bag.candidates = randomForest(candidate ~ ., data=trn.cl.new , mtry=6, importance=TRUE)
bag.candidates
##
## Call:
## randomForest(formula = candidate ~ ., data = trn.cl.new, mtry = 6, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 5.99%
## Confusion matrix:
## Donald Trump Hillary Clinton class.error
## Donald Trump 2047 40 0.01916627
## Hillary Clinton 107 262 0.28997290
plot(bag.candidates)
legend("top", colnames(bag.candidates$err.rate),col=1:4,cex=0.8,fill=1:4)
yhat.bag = predict(bag.candidates, newdata = tst.cl.new)
# Confusion matrix
bag.err = table(pred = yhat.bag, truth = tst.cl.new$candidate)
test.bag.err = 1 - sum(diag(bag.err))/sum(bag.err)
test.bag.err
## [1] 0.05528455
The test set error rate of the bagged classification tree is 0.0553 which is an improvement by 4.22% from the test error rate using an optimally-pruned single tree which was (0.0976).
rf.candidates = randomForest(candidate ~ ., data=trn.cl.new , mtry=2, importance=TRUE)
rf.candidates
##
## Call:
## randomForest(formula = candidate ~ ., data = trn.cl.new, mtry = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 6.72%
## Confusion matrix:
## Donald Trump Hillary Clinton class.error
## Donald Trump 2058 29 0.01389554
## Hillary Clinton 136 233 0.36856369
plot(rf.candidates)
yhat.rf = predict (rf.candidates, newdata = tst.cl.new)
# Confusion matrix
rf.err = table(pred = yhat.rf, truth = tst.cl.new$candidate)
test.rf.err = 1 - sum(diag(rf.err))/sum(rf.err)
test.rf.err
## [1] 0.06666667
While the overall test error is not very different between the bagging (0.0553) and random forest methods (0.0667), the class error for Hillary Clinton worsens from 0.2812 in the bagged forest and 0.3144 in the random forest model as the class error improves slightly for Donald Trump from 0.0192 to 0.0177 respectively.
importance(bag.candidates)
## Donald Trump Hillary Clinton MeanDecreaseAccuracy
## CountyPop 10.085981 10.164926 15.030834
## Men 8.793946 12.202654 15.408343
## White 24.519136 33.553958 33.874999
## Citizen 14.305795 6.781692 16.678137
## Income 11.417194 13.330281 16.954989
## IncomeErr 2.985723 10.320226 10.880983
## IncomePerCap 13.384973 14.520618 18.680711
## IncomePerCapErr 1.693391 14.067144 13.990590
## Poverty 12.815034 7.401784 15.239224
## ChildPoverty 9.635554 6.080630 12.441863
## Professional 15.198920 20.556131 22.745329
## Service 10.490461 11.252728 15.320284
## Office 8.874234 8.311032 12.361477
## Production 7.354873 13.883994 15.783627
## Drive 13.173534 11.859199 17.766385
## Carpool 7.230824 8.939608 12.730542
## Transit 14.481630 46.250004 44.099451
## OtherTransp 2.499826 12.601003 11.226066
## WorkAtHome 9.219281 10.415585 13.180210
## MeanCommute 12.253951 7.760760 14.468004
## Employed 10.538084 16.805031 17.673278
## PrivateWork 11.585718 3.946106 12.411532
## SelfEmployed 10.398478 12.274486 15.434457
## FamilyWork 3.006256 9.368939 8.659078
## Unemployment 14.342196 18.125810 23.229878
## Minority 22.424936 26.930510 30.624428
## weight 0.000000 0.000000 0.000000
## MeanDecreaseGini
## CountyPop 13.258746
## Men 17.904002
## White 83.643692
## Citizen 16.173338
## Income 18.470973
## IncomeErr 14.617777
## IncomePerCap 21.890871
## IncomePerCapErr 15.244334
## Poverty 19.305470
## ChildPoverty 11.345730
## Professional 28.916685
## Service 16.523954
## Office 13.182352
## Production 19.170735
## Drive 15.573892
## Carpool 10.821920
## Transit 100.658479
## OtherTransp 13.191136
## WorkAtHome 9.929198
## MeanCommute 13.100743
## Employed 17.035050
## PrivateWork 11.010123
## SelfEmployed 15.860071
## FamilyWork 9.671087
## Unemployment 24.936221
## Minority 73.686429
## weight 0.000000
importance(rf.candidates)
## Donald Trump Hillary Clinton MeanDecreaseAccuracy
## CountyPop 12.406061 8.229516 14.800457
## Men 5.698375 13.072569 12.808408
## White 20.092723 27.145769 26.791478
## Citizen 8.218701 8.030573 11.871149
## Income 11.268533 13.654612 17.941923
## IncomeErr 1.638665 13.154208 12.624747
## IncomePerCap 11.499110 14.469905 18.589353
## IncomePerCapErr -1.386790 15.304858 12.585600
## Poverty 14.607844 8.490788 17.863356
## ChildPoverty 9.891028 7.919977 13.659752
## Professional 12.370138 19.440270 19.402341
## Service 11.601683 10.023367 16.306188
## Office 8.602865 7.023970 11.139288
## Production 6.272125 14.703312 14.940583
## Drive 15.116874 14.034496 19.725406
## Carpool 7.094749 5.902188 9.146950
## Transit 12.675408 29.001076 28.890978
## OtherTransp 1.489983 12.019211 9.340500
## WorkAtHome 7.471765 8.459778 11.014913
## MeanCommute 10.268248 5.475799 11.412429
## Employed 7.770005 18.108829 16.983414
## PrivateWork 8.633675 3.981321 9.979213
## SelfEmployed 9.837554 12.414358 14.725507
## FamilyWork 1.719984 12.498991 9.892522
## Unemployment 13.081726 15.924146 19.463476
## Minority 20.326633 26.671118 27.367773
## weight 0.000000 0.000000 0.000000
## MeanDecreaseGini
## CountyPop 16.314371543
## Men 20.631733996
## White 59.013466464
## Citizen 19.003874667
## Income 23.176293811
## IncomeErr 20.576809391
## IncomePerCap 25.395491874
## IncomePerCapErr 19.738382672
## Poverty 22.977009066
## ChildPoverty 19.579069731
## Professional 28.035011699
## Service 19.211738725
## Office 15.016555995
## Production 21.981965141
## Drive 21.011376034
## Carpool 12.994329429
## Transit 61.921525148
## OtherTransp 17.275162066
## WorkAtHome 13.736410686
## MeanCommute 14.594803917
## Employed 19.516894794
## PrivateWork 13.432930257
## SelfEmployed 19.555451400
## FamilyWork 14.137843076
## Unemployment 25.368473333
## Minority 58.297769108
## weight 0.003271627
par(mfrow=c(2, 2))
varImpPlot(bag.candidates, sort=T, main="Variable Importance for bag.candidates", n.var=5)
varImpPlot(rf.candidates, sort=T, main="Variable Importance for rf.candidates", n.var=5)
Bootstrap: Perform boostrap to generate plots similar to ISLR Figure 4.10/4.11. Discuss the results.
Use linear regression models to predict the total vote for each candidate by county. Compare and contrast these results with the classification models. Which do you prefer and why? How might they complement one another?
Conduct an exploratory analysis of the “purple” counties– the counties which the models predict Clinton and Trump were roughly equally likely to win. What is it about these counties that make them hard to predict?
Instead of using the native attributes (the original features), we can use principal components to create new (and lower dimensional) sets of features with which to train a classification model. This sometimes improves classification performance. Compare classifiers trained on the original features with those trained on PCA features.